Introduction

This is the third installment of Applying Machine Learning to Kaggle Datasets, a series of ipython notebooks demonstrating the methods described in the Stanford Machine Learning Course. In each noteobok, I apply one method taught in the course to an open kaggle competition.

In this notebook, I demonstrate the use of an artificial neural network to in the Titanic competition.

Outline

Import and examine the data
Create input vectors for the neural network
Set up the network using neurolab library in python
Evaluate model results
Submit results to the Kaggle competition

Import Necessary Modules



In [122]:

    
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
import code.Neural_Net_Funcs as NNF
import neurolab as nl



In [123]:

    
reload(NNF)









    Out[123]:





<module 'code.Neural_Net_Funcs' from 'code/Neural_Net_Funcs.pyc'>

1. Read Titanic Data



In [124]:

    
train = pd.read_csv("./data/titanic/train.csv", index_col="PassengerId")
train.head()









    Out[124]:






  
    
      
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
    
      PassengerId
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      1
       0
       3
                                 Braund, Mr. Owen Harris
         male
       22
       1
       0
              A/5 21171
        7.2500
        NaN
       S
    
    
      2
       1
       1
       Cumings, Mrs. John Bradley (Florence Briggs Th...
       female
       38
       1
       0
               PC 17599
       71.2833
        C85
       C
    
    
      3
       1
       3
                                  Heikkinen, Miss. Laina
       female
       26
       0
       0
       STON/O2. 3101282
        7.9250
        NaN
       S
    
    
      4
       1
       1
            Futrelle, Mrs. Jacques Heath (Lily May Peel)
       female
       35
       1
       0
                 113803
       53.1000
       C123
       S
    
    
      5
       0
       3
                                Allen, Mr. William Henry
         male
       35
       0
       0
                 373450
        8.0500
        NaN
       S



In [125]:

    
#temp = pd.crosstab([train.Pclass, train.Sex],train.Survived.astype(bool))
#temp



In [126]:

    
#sb.set(style="white")
#sb.factorplot('Pclass','Survived','Sex',data=train,palette="muted")
#sb.factorplot('Embarked','Survived','Pclass',data=train,palette="muted")
#sb.factorplot('Embarked','Survived','Sex',data=train,palette="muted")
#fg = sb.FacetGrid(train,hue="Pclass",aspect=3,palette="muted")
#fg.map(sb.kdeplot,"Age",bw=4,shade=True,legend=True)
#fg.set(xlim=(0,80))

2. Create input and output vectors for the neural network



In [171]:

    
reload(NNF)
datain_age,dataout_age,min_max_list_age,pid = NNF.make_input_output(train)
datain,dataout,min_max_list,pid = NNF.make_input_output(train,Age=False)



In [172]:

    
print len(datain_age), len(datain)



In [172]:



In [172]:

3. Set up neural network using neurolab library in python



In [173]:

    
# Get arguments to neurolab net, feed-forward network

# Create the net
# By default, all activation functions are the tangent function
# and all layers have a bias node.
#net.trainf = nl.train.train_gdm



In [178]:

    
# Build and train the network on the training data.
m = datain.shape[0]    # number of observations
ci = datain.shape[1]    # number of input nodes
layers = [ci,1]   # One hidden layer with ci nodes
net = nl.net.newff(min_max_list,layers)
err = net.train(datain, dataout, show=2,goal=0.01,epochs=20)
net.save('myfirst_net_noage.sav')









    



Epoch: 2; Error: 61.0199410394;
Epoch: 4; Error: 60.3566937146;
Epoch: 6; Error: 57.4157801794;
Epoch: 8; Error: 53.8256491684;
Epoch: 10; Error: 46.4204596104;
Epoch: 12; Error: 44.4684620818;
Epoch: 14; Error: 42.1227134574;
Epoch: 16; Error: 41.2205662617;
Epoch: 18; Error: 40.3514747192;
Epoch: 20; Error: 40.0138357294;
The maximum number of train epochs is reached



In [180]:

    
# Train the network on the training data.
m_age = datain_age.shape[0]    # number of observations
ci_age = datain_age.shape[1]    # number of input nodes
layers_age = [ci_age,1]   # One hidden layer with ci nodes
net_age = nl.net.newff(min_max_list_age,layers_age)
err_age = net_age.train(datain_age, dataout_age, show=2,goal=0.01,epochs=20)
net_age.save('myfirst_net_age.sav')









    



Epoch: 2; Error: 217.539895838;
Epoch: 4; Error: 210.2968521;
Epoch: 6; Error: 205.309284427;
Epoch: 8; Error: 199.91987307;
Epoch: 10; Error: 191.917091424;
Epoch: 12; Error: 186.467781263;
Epoch: 14; Error: 183.414792882;
Epoch: 16; Error: 180.644522543;
Epoch: 18; Error: 179.948153752;
Epoch: 20; Error: 178.4347148;
The maximum number of train epochs is reached

4. Evaluate Model Results



In [181]:

    
plt.plot(np.array(err)/len(datain),label='No Age')
plt.hold(True)
plt.plot(np.array(err_age)/len(datain_age), label='Age')
plt.legend()









    Out[181]:





<matplotlib.legend.Legend at 0x10a4d8790>



In [182]:

    
# Print fraction of results correctly modeled
trainsim = np.sign(net.sim(datain))
correct = trainsim==dataout
print "Fraction correct (no age): ",np.sum(correct)/ np.float(len(correct)), len(correct)
trainsim = np.sign(net_age.sim(datain_age))
correct = trainsim==dataout_age
print "Fraction correct (w/ age): ",np.sum(correct)/ np.float(len(correct)), len(correct)









    



Fraction correct (no age):  0.853107344633 177
Fraction correct (w/ age):  0.831932773109 714

5. Run test data through networks



In [183]:

    
test = pd.read_csv("./data/titanic/test.csv", index_col="PassengerId")
reload(NNF)









    Out[183]:





<module 'code.Neural_Net_Funcs' from 'code/Neural_Net_Funcs.pyc'>



In [184]:

    
datain_age,dataout_age,min_max_list_age,pid_age = NNF.make_input_output(test,Test=True)



In [185]:

    
datain,dataout,min_max_list,pid = NNF.make_input_output(test,Test=True,Age=False)



In [186]:

    
predict_age = np.sign(net_age.sim(datain_age))
predict_noage = np.sign(net.sim(datain))

6. Submit prediction to kaggle competition



In [187]:

    
predictions = np.concatenate([predict_age,predict_noage])
predictions = np.where(predictions==1,predictions,0)
passengerid = np.concatenate([pid_age,pid])
dfout = pd.DataFrame(predictions,index=passengerid,columns=['Survived'])
dfout.index.name = 'PassengerID'
dfout = dfout.astype(int)
dfout = dfout.sort()
dfout.to_csv('./predictions/Neural_Network_Prediction.csv',sep=',')

	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
PassengerId
1	0	3	Braund, Mr. Owen Harris	male	22	1	0	A/5 21171	7.2500	NaN	S
2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38	1	0	PC 17599	71.2833	C85	C
3	1	3	Heikkinen, Miss. Laina	female	26	0	0	STON/O2. 3101282	7.9250	NaN	S
4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35	1	0	113803	53.1000	C123	S
5	0	3	Allen, Mr. William Henry	male	35	0	0	373450	8.0500	NaN	S